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Abstract. Finding a basis/coordinate system that can efficiently represent an input 
data stream by viewing them as realizations of a stochastic process is of tremendous 
importance in many fields including data compression and computational neuroscience. 
Two popular measures of such efficiency of a basis are sparsity (measured by the ex- 
pected l v norm, < p < 1) and statistical independence (measured by the mutual 
information). Gaining deeper understanding of their intricate relationship, however, 
remains elusive. Therefore, we chose to study a simple synthetic stochastic process 
called the spike process, which puts a unit impulse at a random location in an re- 
dimensional vector for each realization. For this process, we obtained the following 
results: 1) The standard basis is the best both in terms of sparsity and statistical inde- 
pendence if re > 5 and the search of basis is restricted within all possible orthonormal 
bases in R"; 2) If we extend our basis search in all possible invertible linear trans- 
formations in R n , then the best basis in statistical independence differs from the one 
in sparsity; 3) In either of the above, the best basis in statistical independence is not 
unique, and there even exist those which make the inputs completely dense; 4) There 
is no linear invertible transformation that achieves the true statistical independence 
for n > 2. 

Key words and phrases: Sparse representation, statistical independence, data com- 
pression, basis dictionary, best basis, spike process 

1. Introduction 

What is a good coordinate system/basis to efficiently represent a given set of images? 
We view images as realizations of a certain complicated stochastic process whose proba- 
bility density function (pdf ) is not known a priori. Sparsity is important here since this is 
a measure of how well one can compress the data. A coordinate system producing a few 
large coefficients and many small coefficients has high sparsity for that data. The sparsity 
of images relative to a coordinate system is often measured by the expected l v norm of 
the coefficients where < p < 1. Statistical independence is also important since statis- 
tically independent coordinates do not interfere with each other (no crosstalk, no error 
propagation among them). The amount of statistical dependence of input images relative 
to a coordinate system is often measured by the so-called mutual information, which is a 
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statistical distance between the true pdf and the product of the one-dimensional marginal 
pdf's. 

Neuroscientists have become interested in efficient representations of images, in par- 
ticular, images of natural scenes such as trees, rivers, mountains, etc., since our visual 
system effortlessly reduces the amount of visual input data without losing the essential 
information contained in them. Therefore, if we can find what type of basis functions 
are sparsifying the input images or are providing us with the statistically independent 
representation of the inputs, then that may shed light on the mechanisms of our visual 
system. Olshausen and Field (1996, 1997) pioneered such studies using computational 
experiments emphasizing the sparsity. Immediately after their experiments, Bell and Se- 
jnowski (1997), van Hateren and van der Schaaf (1998) conducted similar studies using 
the statistical independence criterion. Surprisingly, these results suggest that both spar- 
sity and independence criteria tend to produce oriented Gabor-like functions, which are 
similar to the receptive field profiles of the neurons in our primary visual cortex. However, 
the relationship between these two criteria has not been understood completely. 

These experiments and observations inspired our study in this paper. We wish to 
deepen our understanding of this intricate relationship. Our goal here, however, is more 
modest in that we only study the so-called "spike" process, a simple synthetic stochastic 
process, which puts a unit impulse at a random location in an n-dimensional vector for 
each realization. It is important to use a simple stochastic process first since we can gain 
insights and make precise statements in terms of theorems. By these theorems, we now 
understand what are the precise conditions for the sparsity and statistical independence 
criteria to select the same basis for the spike process. In fact, we prove the following 
facts. 

• The standard basis is the best both in terms of sparsity and statistical independence 
if n > 5 and the search of a basis is restricted within all possible orthonormal bases 
in R n . 

• If we extend our basis search in all possible invertible linear transformations in R n , 
then the best basis in statistical independence differs from the standard basis, which 
is the best in sparsity. 

• In either of the above, the best basis in statistical independence is not unique, and 
there even exist those which make the inputs completely dense; 

• There is no linear invertible transformation that achieves the true statistical inde- 
pendence for n > 2. 

These results and observations hopefully lead to deeper understanding of the efficient 
representations of more complicated stochastic processes such as natural scene images. 
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More information about other stochastic processes, such as the "ramp" process (an- 
other simple yet important stochastic process), can be found in Saito et al. (2000, 2001), 
which also contain our numerical experiments on natural scene images. 

The organization of this paper is as follows. In Section 2, we set our notations 
and terminology. Then in Section 3, we precisely define how to quantitatively measure 
the sparsity and statistical dependence of a stochastic process relative to a given basis. 
Using a very simple example, Section 4 demonstrates that the sparsity and statistical 
independence are two clearly different concepts. Section 5 presents our main results. We 
prove these theorems in Section 6 and Appendices. Finally, we discuss the implications 
and further directions in Section 7. 

2. Notations and Terminology 

Let us first set our notation and the terminology of basis dictionaries and best bases. 
Let X G R n be a random vector with some unknown pdf fj£. Let us assume that 
the available data T = {xi, . . . , Xn} were independently generated from this probability 
model. The set T is often called the training dataset. Let B = (wi, . . . ,w n ) G O(n) 
(the group of orthonormal transformations in R n ) or SL ± (n, R) (the group of invertible 
volume-preserving transformations in R™, i.e., their determinants are ±1). The best-basis 
paradigm, Coifman and Wickerhauser (1992), Wickerhauser (1994), Saito (2000), is to 
find a basis Bora subset of basis vectors such that the features (expansion coefficients) 
Y = B~ l X are useful for the problem at hand (e.g., compression, modeling, discrimi- 
nation, regression, segmentation) in a computationally fast manner. Let C(B | T) be a 
numerical measure of deficiency or cost of the basis B given the training dataset T for 
the problem at hand. For very high- dimensional problems, we often restrict our search 
within the basis dictionary T> C SL ± (n, R) , such as the orthonormal or biorthogonal 
wavelet packet dictionaries or local cosine or Fourier dictionaries where we never need to 
compute the full matrix-vector product or the matrix inverse for analysis and synthesis. 
Under this setting, B± = argmin Be x) C(B | T) is called the best basis relative to the cost 
C and the training dataset T. Section 6.3 reviews the concept of the basis dictionary and 
the best-basis algorithm in details. 

We also note that log in this paper implies log 2 , unless stated otherwise. 

3. Sparsity vs. Statistical Independence 

The concept of sparsity and that of statistical independence are intrinsically different. 
Sparsity emphasizes the issue of compression directly, whereas statistical independence 
concerns the relationship among the coordinates. Yet, for certain stochastic processes, 
these two are intimately related, and often confusing. For example, Olshausen and Field 
(1996, 1997) emphasized the sparsity as the basis selection criterion, but they also as- 
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sumed the statistical independence of the coordinates. Bell and Sejnowski (1997) used 
the statistical independence criterion and obtained the basis functions similar to those of 
Olshausen and Field. They claimed that they did not impose the sparsity explicitly and 
such sparsity emerged by minimizing the statistical dependence among the coordinates. 
These motivated us to study these two criteria. 

First let us define the measure of sparsity and that of statistical independence in our 
context. 

3.1 Sparsity 

Sparsity is a key property as a good coordinate system for compression. The true 
sparsity measure for a given vector x e R™ is the so-called £° quasi-norm which is defined 

as 

IMIo = #{« e [l,n] : Xi + 0}, 

i.e., the number of nonzero components in x. This measure is, however, very unstable 
for even small perturbation of the components in a vector. Therefore, a better measure 
is the l p norm: 

/ n \ VP 

\\x\\p= (EW P J , o<p<i. 

In fact, this is a quasi-norm for < p < 1 since this does not satisfy the triangle 
inequality, but only satisfies weaker conditions: \\x + y\\ p < 2~ 1 / p '(||a;||p + \\y\\ p ) where 
p' is the conjugate exponent of p; and \\x + y\\ p < \\x\\ p + \\y\\p- It is easy to show that 
lim pl0 \\x\\ p p = \\x\\ . See Day (1940), Donoho (1994, 1998) for the details of the l p norm 
properties. 

Thus, we can use the expected i p norm minimization as a criterion to find the best 
basis for a given stochastic process in terms of sparsity: 

(3.1) C p {B\X) = E\\B- l X\\l, 

The sample estimate of this cost given the training dataset T is 

i N 1 N n 

(3-2) C P (B | T) = - £ IMS = T7 E E 

JV k=l JV k=l i=l 

where y k = (yi,fc, . . . , y n ,k) T = B~ 1 Xk and x k is the kth sample (or realization) in T. We 
propose to use the minimization of this cost to select the best sparsifying basis (BSB): 

B p = B P (T,V) = argmmC p (5 | T). 

Remark 1. It should be noted that the minimization of the £ p norm can also be 
achieved for each realization. Without taking averages in (3.2), one can select the BSB 
B p = B p ({x k },V) for each realization x k e T. We can guarantee that 

minC D (-B I ix k \) < min CJB I T) < m&xCJB I \xk\). 
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For highly variable or erratic stochastic processes, however, B p ({x k }, V) may significantly 
change for each k and we need to store more information of this set of N bases if we want 
to use them to compress the entire training dataset. Whether we should adapt a basis 
per realization or on the average is still an open issue. See Saito et al. (2000, 2001) for 
more details. 

3.2 Statistical Independence 

The statistical independence of the coordinates of Y G R n means 

fyiv) = /Vi(s/i)/y 2 (s/2) • • -fY n (y n ), 

where fy k (yk) is a one-dimensional marginal pdf. The statistical independence is a key 
property as a good coordinate system for compression and particularly modeling because: 
1) damage of one coordinate does not propagate to the others; and 2) it allows us to model 
the n-dimensional stochastic process of interest as a set of ID processes. Of course, in 
general, it is difficult to find a truly statistically independent coordinate system for a given 
stochastic process. Such a coordinate system may not even exist for a certain stochastic 
process. Therefore, we should be satisfied with finding the least-statistically dependent 
coordinate system within a basis dictionary. Naturally, then, we need to measure the 
"closeness" of a coordinate system Y±, . . . ,Y n to the statistical independence. This can 
be measured by mutual information or relative entropy between the true pdf /y and the 
product of its marginal pdf's: 

HY)= I f Y (v) tog J Y f V l, x dy = ~H(Y) + £ H(YJ, 

J [U=iJYAyi) i= i 

where H(Y) and H(Yj) are the differential entropy of Y and Y { respectively: 

H(Y) = -j f Y (y) \ogf Y (y)dy, H(Y t ) = -f f Yi ( yi ) log f Yi {yi) d Vi . 

We note that I(Y) > 0, and I(Y) = if and only if the components of Y are mutually 
independent. See Cover and Thomas (1991) for more details of the mutual information. 

Suppose Y = B~ l X and B e GL(n, R) with det(B) = ±1. We denote such a group 
of matrices by SL ± (n, R). Note that the usual SL(n, R) is a subgroup of SL ± (n,R). 
Then, we have 

n n 

I(Y) = —H(Y) + £ H(YJ = -H{X) + £ H(Yi), 

i=l i=l 

since the differential entropy is invariant under such a invertible volume-preserving linear 
transformation, i.e., 

H(B~ 1 X) = H(X) + log | det(fi" 1 )| = H(X), 
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because | det(B x ) | = 1. Based on this fact, we proposed the minimization of the following 
cost function as the criterion to select the so-called least statistically- dependent basis 
(LSDB) in Saito (2001): 

n n 

(3.3) C H (B \X) = Y,H ((B-'XU) = £ H(Y % ). 

i=i i=i 

The sample estimate of this cost given the training dataset T is 

1 N n 

c h (b\t) = --j2Y,^Mm,k), 

iv fe=i i=i 

where /^(y^fe) is an empirical pdf of the coordinate which must be estimated by an 
algorithm such as the histogram-based estimator with optimal bin-width search of Hall 
and Morton (1993). Now, we can define the LSDB as 

(3.4) B L sdb = B LSDB {T,V) = argminC^B | T). 

We note that the differences between this strategy and the standard independent com- 
ponent analysis (ICA) algorithms are: 1) restriction of the search in the basis dictionary 
T>; and 2) approximation of the coordinate-wise entropy. For more details, we refer the 
reader to Saito (2001) for the former and Cardoso (1999) for the latter. 
Now we describe our analysis of some simple stochastic processes. 

4. Two-Dimensional Counterexample 

This example clearly demonstrates the difference between the sparsity and the sta- 
tistical independence criteria. Let us consider a simple process X = (X 1 ,X 2 ) T where Xi 
and X2 are independently and identically distributed as the uniform random variable on 
the interval [—1,1]. Thus, the realizations of this process are distributed as the right- 
hand side of Figure 1. Let us consider all possible rotations around the origin as a basis 
dictionary, i.e., T> = SO(2,R) C 0(2). Then, the sparsity and independence criteria 
select completely different bases as shown in Figure 1. Note that the data points under 
the BSB coordinates (45 degree rotation) concentrate more around the origin than the 
LSDB coordinates (with no rotation) and this makes the data representation sparser. 
This example clearly demonstrates that the BSB and the LSDB are different in general. 
One can also generalize this example to higher dimensions. 

5. The Spike Process 

An n-dimensional spike process simply generates the standard basis vectors {ej}™ =1 C 
R n in a random order, where ej has one at the jth entry and all the other entries are 
zero. One can view this process as a unit impulse located at a random position between 
1 and n as shown in Figure 2. 
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Fig. 1. Sparsity and statistical independence prefer the different coordinates. 
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Fig. 2. Ten realizations of the spike process (n = 256). 
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5.1 The Karhunen-Loeve Basis 

Let us first consider the Karhunen-Loeve basis of this process from which we can 
learn a few things. 

Proposition 5.1. The Karhunen-Loeve basis for the spike process is any orthonor- 
mal basis in R™ containing the "DC" vector l n = (1,1,..., 1) T . 

This means that the KLB is not useful for this process. This is because the spike process 
is highly non-Gaussian. 

5.2 The Best Sparsifying Basis 

It is obvious that the standard basis is the BSB among O(n) by construction; an 
expansion of a realization of this process into any other basis simply increases the number 
of nonzero coefficients. More precisely, we have the following proposition. 

Proposition 5.2. The BSB for the spike process is the standard basis ifV = O(n) 
or SL ± (n, R). IfV = GL(n, R) ; then it must be a scalar multiple of the identity matrix, 
i.e., al n where a is a nonzero constant. 

Remark 2. Note that when we say the basis is a matrix such as al n , we really mean 
that the column vectors of that matrix form the basis. This also means that any permuted 
and/or sign-flipped (i.e., multiplied by —1) versions of those column vectors also form the 
basis. Therefore, when we say the basis is a matrix A, we mean not only A but also its 
permuted and sign-flipped versions of A. This remark also applies to all the propositions, 
lemmas, and theorems below, unless stated otherwise. 

5.3 Statistical Dependence and Entropy of the Spike Process 

Before considering the LSDB of this process, let us note a few specifics about the spike 
process. First, although the standard basis is the BSB for this process, it clearly does not 
provide the statistically independent coordinates. The existence of a single spike at one 
location prohibits spike generation at other locations. This implies that these coordinates 
are highly statistically dependent. 

Second, we can compute the true entropy H(X) for the spike process unlike other 
complicated stochastic processes. Since the spike process selects one possible vector from 
the standard basis of R ra with uniform probability 1/n, the true entropy H(X) is clearly 
logn. This is one of the rare cases where we know the true high-dimensional entropy of 
the process. 
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5.4 The LSDB among the Haar-Walsh Dictionary 

Our first theorem specifies the LSDB selected from the well-known Haar-Walsh dic- 
tionary, a subset of 0(n). This dictionary contains a large number of orthonormal bases 
(in fact, more than 2 n//2 bases) including the standard basis, the Haar basis (consists of 
dyadically-scaled and shifted versions of boxcar functions), and the Walsh basis (con- 
sisting of square waves). Because the basis vectors in this dictionary are all piecewise 
constant (except the standard basis vectors), they are often used to analyze and com- 
press discontinuous or blocky signals such as acoustic impedance profiles of subsurface 
structure. See Wickerhauser (1994), Saito (2000), and Section 6.3 of this paper for the 
details of this dictionary. 

Theorem 5.1. Suppose we restrict our search of the bases within the Haar-Walsh 
dictionary. Then, the LSDB is: 

• the standard basis if n > 4; and 

• the Walsh basis if n = 2 or 4. 

Moreover, the true independence can be achieved only for n = 2. Note that n is always a 
dyadic number in this dictionary. 



5.5 The LSDB among O(n) 

It is curious what happens if we do not restrict ourselves to the Haar-Walsh dictionary. 
Then, we have the following theorem. 

Theorem 5.2. The LSDB among O(n) is the following: 

• for n > 5, either the standard basis or the basis whose matrix representation is 



(5.1) 



O(n) 



1 

n 



n-2 -2 
-2 n-2 

-2 

-2 -2 



-2 -2 
-2 

n-2 -2 
-2 n-2 



for n = 4, the Walsh basis, i.e., -E>o(4) 



1111 
11-1-1 
1-11-1 
1-1-1 1 
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for n = 3, 5 (3) 



1 



i 



i 



V3 \f& y/2 

_1 1 -1 

\/Q \f2 

J_ ^2 n 

V3 Ve u 



for n = 2, B oi 2) = ^ 
is achieved. 



1 1 
1 -1 



: and 



, and this is the only case where the true independence 



Remark 3. There is an important geometric interpretation of (5.1). This matrix can 
also be written as: m 




In other words, this matrix represents the Householder reflection with respect to the 
hyperplane {y G R n | Yn^Vi — 0} whose unit normal vector is l n / y/n. 



5.6 The LSDB among GL(n, R) 

Before discussing the LSDB among a larger class of bases, let us remark an important 
specifics for a discrete stochastic process. 

Let X be a random vector obeying a discrete stochastic process with a probability 
mass function (pmf) fj£. This means that there are only finite number of possible values 
(or states) X can take. Clearly the spike process is a discrete process since the only 
possible values are {e±, . . . ,e n }, the standard basis vectors. Then, for any invertible 
transformation B E GL(n, R) with Y = B~ 1 X, be it orthonormal or not, the total 
entropy of the process before and after the transformation is exactly the same. Indeed, 
in the definition of discrete Shannon entropy, — J2j Pjlogpj, the values that the random 
variable takes are of no importance; only the number of possible values the random 
variable can take and its pmf matter. In our case, it is clear that the events {X = a^} 
and {Y = bi} where hi = B^a-i are equivalent; otherwise the transformation would not 
be invertible. This shows that the corresponding probabilities are equal: 

Pr{X = a,} = Pr{Y = b t }. 

Therefore, considering the expression of discrete entropy, this proves that 

H(Y)=H(X), 

as long as the transformation matrix belongs to GL(n, R). Note that for the continuous 
case, this is only true if B e SL ± (n,R). Therefore, for a discrete stochastic process like 
the spike process, the LSDB among GL(n, R) can be selected by just minimizing the 
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sum of the coordinate-wise entropy as (3.4) as if V = SL ± (n, R). In other words, there 
is no important distinction in the LSDB selection from GL(n, R) and from SL ± (n, R) 
for discrete stochastic processes. Therefore, we do not have to treat these two cases 
separately. 

Theorem 5.3. The LSDB among GL(n, R) with n > 2 is the following basis pair 
(for analysis and synthesis respectively): 



(5.2) 



rj-l 

^GLfn.R) 



a a 

b 2 c 2 b 2 ■ ■ ■ 

b 3 h c 3 6 3 



a 
b 2 



b n -l 

b n 



bn-l C n _l b n -\ 



b c 



(5.3) 



B, 



GL(n,R) — 



where a, bk, ct are arbitrary real-valued constants satisfying a 7^ 0, b^ 7^ Ck, k = 2, 

(1 + EL2 hd k ) /a -d 2 -d 3 ■ ■ ■ -d, n 
—b 2 d 2 /a d 2 ••• 

-b 3 d 3 /a d 3 '■■ : 

; ; •■. •■. 

-b n d n /a ••• d n 

where dk = 1 / (c^ — bk), k — 2, . . . , n. 

If we restrict ourselves to V = SL ± (n, R), then the parameter a must satisfy: 



,n. 



a = ±]J(c k - b k ) 

k=2 



-1 



Remark 4. The LSDB such as (5.1) and the LSDB pair (5.2), (5.3) provide us with 
further insight into the difference between sparsity and statistical independence. In the 
case of (5.1), this is the LSDB, yet does not sparsify the spike process at all. In fact, 
these coordinates are completely dense, i.e., C = n. We can also show that the sparsity 
measure C p gets worse as n — > 00. More precisely, we have the following proposition. 

Proposition 5.3. 



\imC p (B 0{n) \x) 



00 
3 



ifO<P<l; 
ifp = 1. 



11 



It is interesting to note that this LSDB approaches to the standard basis as n — > oo. 
This also implies that 



Jim C p (-Bo(n) | X) ^ C p (inn B 0(n) | X 



As for the analysis LSDB (5.2), the ability to sparsify the spike process depends on 
the values of bk and c^. Since the parameters a, bk and Ck are arbitrary as long as a ^ 
and bk ^ Ck, let us put a — 1, bk — 0, Ck — 1, for A; = 2, . . . , n. Then we get the following 
specific LSDB pair: 



R — 1 

GL(n,R) 



11 • • • 1 



: In-l 





B, 



GL(n,R) 



1 -1 




In-l 



This analysis LSDB provides us with a sparse representation for the spike process (though 
this is clearly not better than the standard basis). For Y = B^l^^X , 

C = E [||F||o] = - x 1 + ^— - x 2 = 2 - -. 

n n n 

Now, let us take a — 1, b k — 1, c k — 2 for k — 2, . . . , n in (5.2) and (5.3). Then we get 



(5.4) 



R — 1 

GL(n,R) 



1 1 ••• 1 

12''.! 

; '•. ••. i 

1 ••• 1 2 



B, 



GL(n,R) 



n —1 
-1 

-1 



-1 



In-l 



The spike process under this analysis basis is completely dense, i.e., Co = n. Yet this is 
still the LSDB. 



Finally, from Theorems 5.2 and 5.3, we can prove the following corollary: 

Corollary 5.1. There is no invertible linear transformation providing the statis- 
tically independent coordinates for the spike process for n > 2. In fact, the mutual 
information I [Bq^X^ and I (-BGL(nR)-^) are monotonically increasing as a function 
of n, and both approaches to loge ~ 1.4427 as n — > oo. 



Remark 5. Although the spike process is very simple, we have the following inter- 
pretation. Consider a stochastic process generating a basis vector randomly at a time 
selected from some orthonormal basis. Then, both that basis itself is the BSB and the 
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LSDB among 0(n). Theorem 5.2 claims that once we transform the data to the spikes, 
one cannot do any better than that both in sparsity and independence within O(n). Of 
course, if one extends the search to nonlinear transformations, then it becomes a different 
story. We refer the reader to our recent articles Lin et al. (2000, 2001) for the details of 
a nonlinear algorithm. 



6. Proofs of Propositions and Theorems 



6. 1 Proof of Proposition 5. 1 

Proof. Let X = (Xi,X%, . . . ,X n ) T be a random vector generated by this process. 
For each of its realizations, a randomly chosen coordinate among these n positions takes 



the value 1, while the others take the value 0. Hence each X; 



, n, takes the 




1 / I 

- x 1+ 1-- 

n V n 



x 



values 1 with probability 1/n and the value with probability 1 — 1/n. Let us calculate 
the covariance of these variables. First, we have: 

— for i — 1, . . . , n 

n 

E(Xf) = E(X, l ) \ii = j- 
iii^j, 

since one of these two variables will always take the value 0. Let R 
covariance matrix of this process. Then, we have: 



(Rij) be the 



Rij — E(XiXj) — E(Xi)E(Xj) 



l s 1 

-Oij - — 



n 



We know that a basis is a Karhunen-Loeve basis if and only if it is orthonormal and diag- 
onalizes the covariance matrix. Thus, we will now calculate the eigenvalue decomposition 
of the covariance matrix R = — \ J n , where I n is the identity matrix of size n x n, 
and J„ is an n x n matrix with each entry taking the value 1. 
We now need to calculate the determinant: 



P R (X) = det(A7 n - R) 



2 „2 



J_ 1 \ _ I I I 

„2 • • • „2 ^ „ T o 



which is of the generic form: 



A(a,b) 



a + b b .. 
b a + b' 



••. b 

b a + b 
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with the values a = X — 1/n and b = 1/n 2 . This is calculated by subtracting the last row 
from all the others, and then adding all n — 1 columns to the last one. 



(6.1) 



A(a,b) = 



a ... —a 
a ■•. ! i 

; ••. ••. o ; 

. . . a —a 

b b a + b 



a ... 
a ■•. i i 

: = a n ~ L (a + nb). 
... a 
b b a + nb 

Putting a = A — 1/n and b = 1/n 2 , we have the characteristic polynomial Pr of R as 
PrW = -M^ ~~ 1/n)" -1 . Hence, the eigenvalues of i? are A = or 1/n. 

It is now obvious that the vector l n = (1, . . . , 1) T is an eigenvector for R associated 
with the eigenvalue 0, i.e., l n G keri?. Indeed, we have 



Rl n 



n n z 



1 1 

- In. n nl„ = 0. 



n 

Since dimkeri? = 1, keri? is the one-dimensional subspace spanned by l n . Considering 
that R is symmetric and only has two distinct eigenvalues, we know that the eigenspace 
associated to the eigenvalue 1/n is orthogonal to keri?, which is the hyperplane {y G 
R n | J2i=iVi — 0}. Therefore, the orthogonal bases that diagonalize R are the bases 
formed by the adjunction of l n to any orthogonal basis of keri?- 1 . The Walsh basis, 
which consists of oscillating square waves, is such a basis, although it is just one among 
many. □ 

6.2 Proof of Proposition 5.2 

Proof. The case V = 0(n) is obvious as discussed before this proposition. There- 
fore, we first prove the case V = GL(n, R). To maximize the sparsity, it is clear that 
the transformation matrix must be diagonal (modulo permutations and sign flips), i.e., 
Bp = diag(ai, . . . , a n ) with a*, ^ 0, k — 1, . . . , n. The sparsity cost C p defined in (3.1) can 
be computed and bounded in this case as follows: 



C P {B P \X)=E\\Y\\* = -Y,W\ P > M P , 

n , — i 



k=i 



where \a\ = min{| CL\ \ , . . . , \o, n |}. This lower bound is achieved when B p = al n , i.e., a 
nonzero constant times the standard basis. Now, if T> = SL ± (n, R), then this constant a 



must be either 1 or —1 since det(B r , 



±1 and a G R. 



□ 



6.3 A Brief Review of the Haar-Walsh Dictionary and the Best-Basis Algorithm 

Before proceeding to the proof of Theorem 5.1, let us first review the Haar-Walsh 
dictionary and define some necessary quantities. 
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Let n be a positive dyadic integer, i.e., n = 2 n ° for some no € N. An input vector 
a? = (#1, . . . ,x n ) T , viewed as a digital signal sampled on a regular grid in time, is first 
decomposed into low and high frequency bands by the convolution-subsampling opera- 
tions on the discrete time domain with the pair consisting of a "lowpass" filter {h^}f =1 
and a "highpass" filter {ge}f =l - Let H and G be the convolution-subsampling operators 
using these filters which are defined as: 

L L 

(Hx) k = Y,h£x e+2 ( k -i), (Gx) k = Y,9i x i+2(k-i), k = l,...,n. 
1=1 e=i 

We assume the periodic boundary condition on x (whose period is n). Hence, the filtered 
sequences Hx and Gx are also periodic with period n/2. Their adjoint operations (i.e., 
upsampling-anticonvolution) H* and G* are defined as 

(H*x) k = h k -2{t-i)Xt, (G*x) k = 9k-2{t-\)Xt, fc = l,...,2n. 

l<k-2((-l)<L l<k-2(t-l)<L 

The filter H and G are called conjugate mirror filters (CMF's) if they satisfy the following 
orthogonality (or perfect reconstruction) conditions: 

HG* = GH* = and H*H + G*G = I, 

where I is the identity operator. Various design criteria (concerning regularity, symmetry 
etc.) on the lowpass filter coefficients {he} can be found in Daubechies (1992). The Haar- 
Walsh dictionary uses the filter pair with the shortest length (L = 2) and hi = h 2 = 
l/y/2. Once {he} is fixed, the filter G is obtained by setting g f = (— This 
decomposition process is iterated on both the low and high frequency components. The 
first level decomposition generates two subsequences Hx and Gx each of which has length 
n/2. In the case of the Haar- Walsh dictionary, these subsequences are: 

( Xi + X2 X n _i + X n \ ( Xi — X2 X n _i — X n \ 

The second level generates four subsequences, H 2 x, GHx, HGx, G 2 x, each of which is 
of length n/A. If we repeat this process for k times (k = 0,1, ... ,K < no), then at the 
kth level, 2 k subsequences H k x, GH k ~ 1 x, . . ., G k ~ 1 Hx, G k x, each of which is of length 
2 n °~ k , are generated. As a whole there are (k + l)n expansion coefficients (including the 
original components of x). One can iterate this procedure and stop at the Kth level, 
where K < n . These coefficients are naturally organized in the binary tree structure 
as shown in Figure 3. For future reference, we refer to the tree with K = no as the 
maximal- depth tree or the full tree. Because of the perfect reconstruction condition on 
H and G, each decomposition step is also interpreted as a decomposition of the vector 
space into mutually orthogonal subspaces. Let O ,o denote the n-dimensional Euclidean 
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Fig. 3. A table of dictionary coefficients are organized as the binary tree structured table. 



space R n spanned by the standard basis vectors. Hence, an input vector of length n is 
an element of f2o,o- Let f2i j0 and fi^i be mutually orthogonal subspaces generated by 
the application of the operators H and G respectively to the parent space f2 - Then, in 
general, the kth step of the decomposition process (k — 0, . . . , K) can be written as 

&k,e — ^fc+i,2£ © ^fc+i,2«+i £ = 0, . . . ,2 k — 1. 

It is clear that dimf2 fc ,. = 2 n °~ k . For each subspace Q k> t, we associate the basis vectors 
Wk,e,™ £ R- 71 , m — 0, . . . , 2 no ~ k — 1 which span this subspace. The vector w k ^ m is 
roughly centered at 2 k m, has length of support 2 fc , and oscillates xs £ times. Note 
that for k = 0, we have the standard basis of R n . The expansion coefficients computed 
by the convolution-subsampling operations can be viewed as the inner products between 
the input vector and these basis vectors although we never need to compute these inner 
products explicitly. Clearly, we have a redundant set of subspaces in the binary tree. In 
fact, it is easily proved that there are more than 2 2K 1 possible orthonormal bases in this 
binary tree; see e.g. Wickerhauser (1994) for the details. Because of this abundance of 
the bases, such a binary tree of subspaces (or basis vectors) is called a wavelet packet 
dictionary for general CMF's and the Haar-Walsh dictionary if L — 2 and hi — hi — 
1 / \[2. Now an important question is how to select the best coordinate system efficiently 
for the problem at hand from this dictionary. 

The "best-basis" algorithm of Coifman and Wickerhauser (1992) first expands an 
input vector into a specified basis dictionary. Then a complete basis called a best basis 
(BB) which minimizes a certain cost function (such as the sparsity cost C p (3.1) or the 
statistical dependence cost Ch (3.3); see also Saito (2000) for a variety of cost functions 
for different problems such as classification and regression) is searched in this binary tree 
using the divide-and-conquer algorithm. More precisely, let B k ^ denote a set of basis 
vectors belonging to the subspace Q k ,e arranged as a matrix 

(6.2) B k}(: = (w k ,e,o, w k,e,2™o-k-i)- 

Now let A kt g be the best basis for the input signal x restricted to the span of B k ^ and 
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let C be a cost function measuring the deficiency of the nodes (subspaces) such as C p or 
Ch- The following best-basis algorithm "prunes" this binary tree by comparing the cost 
of each parent node and its two children nodes: 
Given an input vector x e R n , 

Step 0: Choose a basis dictionary V, specify the maximum depth of decomposition K, 
and an information cost C. 

Step 1: Expand x into the dictionary V and obtain coefficients {B ke x} 0<k<K o<^<2 fc -i- 

Step 2: Set A Kjt = B Kjt for £ = 0, . . . , 2 K - 1. 

Step 3: Determine the best subspace A k j in the bottom- up manner, i.e., for k = K — 
1,...,0, £ = 0,...,2 fe -l,by 



(6.3) A k/ 



B k ,i if C(Bl e x) < C(AT +h2e x U A T k+l 

,2£+l X )i 

A k +i,2£ © A k+h2 e+i otherwise. 



This algorithm becomes fast if the cost function C is additive, i.e., C(0) = and C(x) = 
J2iC(xi). Both C p of (3.1) and C H of (3.3) are clearly additive. If C is additive, then in 

(6.3) we have 

^(^k+l,2e X U ^k+l,2£+l X ) = ^(^-k+l,2£ X ) + ^(^k+i,2e+i x )- 

This implies that a simple addition suffices instead of computing the cost of union of the 
nodes. 

Coming back to the Haar- Walsh case, we need a few more definitions for the proof of 
Theorem 5.1. At each level of the decomposition, the leftmost node (or box) representing 
the coefficients H k x is marked by + in Figure 3. This node also corresponds to the 
subspace fl k ,o- Clearly, each coefficient in this node must be of the form 

(6.4) (x a{1) + ■ ■ ■ + x a{2k) ) 

where a is a permutation of {l,...,n}. We call these nodes and the corresponding 
coefficients the positive node and the positive coefficients, respectively. All the other 
nodes marked by — sign at the fcth level corresponding to the subspaces Q k ,e, £ ^ 0, 
contain coefficients of the form: 

(6.5) —7= (x(r(i) + . . . + x a (2k-i) - av( 2 k-i+i) - ... - aV( 2 *o) • 



/ 2 fc 

These nodes and coefficients are referred to as negative nodes and negative coefficients, 
respectively. We note that any descendant node of a negative node must be negative. In 
fact, only the left child node of a positive node can be positive. 



17 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Fig. 4. A plot of / : x — > — [xlogx + (1 — x) log(l — x)]. 

6.4 Proof of Theorem 5.1 

Proof. Let us consider the positive coefficients. The fcth-level positive node contains 
n/2 k coefficients each of which is generated by (6.4), which in the case of the spike process 
can take only the following values: 

• + 1/V2 k with probability 2 k /n; 

• with probability 1 — 2 k /n. 

Thus the entropy of each coordinate in the /cth-level positive node can be computed as 

m»m**(*) + K)*K)) -'(*)■ 

where 

(6.6) f(x) 4 -[xlog(x) + (l-x) log(l - x)}, 

which is displayed in Figure 4. The following properties of this function / are basic and 
will be used repeatedly in this paper: 

• For all x G [0, 1], f(x) > and f(x) = if and only if x = or x = 1; 

• For all x G [0, 1], f(x) = f(l - x); 

• / is increasing on [0, 1/2], and decreasing on [1/2, 1]; 

• / is concave on [0, 1]. 

On the other hand, the remaining n — (n/2 k ) negative coefficients at level k are 
computed by (6.5), which can take three different values: 
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0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Fig. 5. A plot of g : x — > — [x log § + (1 — x) log(l — a;)] . 

• +1/V2* with probability 2 fe_1 /™; 

• -l/v 7 ^ with probability 2 k ~ 1 /n; 

• with probability 1 — 2 k /n. 

Thus the entropy of each negative coordinate of level k is 

m*>hh^) + KM 1 -*))-•(*)■ 

where 

(6.7) #(x) = -\x log(x/2) + (1 - x) log(l - x)] = /(x) + x, 

which is plotted in Figure 5. 

The following lemma is used to compare the entropy cost between a parent node and 
its children nodes of the Haar- Walsh dictionary. 

Lemma 6.1. 

(6.8) h-(k) < h-(k + l) 

(6.9) h+(k)<l[h+(k + l) + h-(k + l)], 

for k = 1, . . . ,n - 2. 

Proof. Using the function g defined in (6.7), we have h_(k) — h-(k + 1) = g(2 k /n) — 
g(2 k+1 /n). As shown in Figure 6, the function g(x) — g(2x) is always negative as long as 
x = 2 k /n < 0.43595 • • •. Since n = 2™°, this implies that k-n < log(0.43595) w -1.1977, 
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0.25 

Fig. 6. A plot of the function g(x) — g(2x). 



i.e., k < n — 2. Hence we have proved (6.8). To prove (6.9), we have h + (k) — \[h + (k + 
1) + h-{k + 1)] = f{2 k /n) - \[f{2 k+l /n) + g(2 k+l /n)}. However, 

f{x) - \ [f(2x) + g{2x)] = f{x) - \ \g(2x) - 2x + g(2x)} 

= ifix)+x)-gi2x) 
= g(x)-g(2x)<0, 

if x = 2 k /n < 0.43595, i.e., k < n - 2 as before. □ 

Inequality (6.8) implies that the entropy corresponding to a negative coordinate at 
one level is smaller than that of the level below. Therefore, a negative parent node has 
smaller entropy than its two negative children nodes provided that the children nodes 
are not the bottom leaves, i.e., if the maximal decomposition level K satisfies K < hq. 
In fact, we have h_(n — 2) < h_(n — 1), but /i_(no — 1) > h_(n ). This means that 
starting from the non-maximal depth negative nodes, the best-basis algorithm always 
chooses the furthest possible ancestor negative nodes. 

As for the positive nodes, from (6.9), we can compare the total entropy of the positive 
node at level k with that of the two children nodes (positive and negative) as follows: 

77 77 

¥ h + (k)<—[h + (k + l) + h4k + l)], 

since the parent node contains n/2 k coordinates and each of the children node has n/2 k+1 
coordinates. Therefore, again the parent positive node has smaller entropy than its two 
children nodes as long as the tree is of non-maximal depth. 

These two facts prove that the best-basis algorithm seeking the minimum entropy 
selects the root node, i.e., the LSDB is the standard basis, if K < n — 1. 
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Now, we need to consider the case of the maximal-depth tree. Notice that although 
(6.8) does not hold for k — n — 1, the following holds: 

h-(n - 3) < h-(n ), 

since g(l) > g(l/S) (see also Figure 5). This allows us, using the best-basis algorithm, 
to move up from a pair of bottom leaves not to their immediate parent but to their 
"great-grandfather" , with decreasing entropy, as long as this great-grandfather is still a 
negative node and n > 3. We still need to consider what happens if this assumption 
is false, that is, if we have maximal-depth leaves with positive great-grandfather. The 
self-similar structure of this tree proves that this problem is equivalent to the general 
problem with n = 3, which we shall now discuss. 

n = 3 (i.e., n — 8): Let us show that whatever the set of coordinates chosen among 
these, the entropy they generate is larger than that of the root node, which is also 
positive. The entropy of the root node is: 8 x /(1/8) ~ 4.34 bits. 

The choice of a basis in this dictionary is equivalent to the choice of a binary tree 
of depth K < 3. This reduces to: 

• the choice of the level of the positive node in the basis, which also amounts to 
the choice of the depth of the leftmost leave of the tree. 

• the choice of an orthonormal basis of the subspace orthogonal to the chosen 
positive node. 

We note that all the negative coordinates of the tree have larger entropy than those 
of the bottom leaves: this is derived from g(l) < <?(l/4) < g(l/2) (see Figure 5). 
Thus, the entropy of any basis with its positive node on level k is larger than 

2 n -k x y(2 fc - no ) + (n- 2 n °- k ) x g(l). 

Then there are three different cases corresponding to the level of the positive node: 

• if the positive node is on the bottom level, then we only have one positive 
coordinate, and seven negative ones; therefore, the entropy of any such basis is 
larger than f(l) + 7 x g(l) = 7 bits; 

• if the positive node is on level k = 2, we have two positive coordinates and six 
negative ones; thus the entropy of any such basis is larger than 2 x /(1/2) + 6 x 
g(l) ~ 8 bits; 

• finally, if the positive node is on level k — 1, then the entropy of any such basis 
is larger than 4 x /(1/4) + 4 x g(l) ~ 7.24 bits. 
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Fig. 7. The Haar- Walsh dictionary table of depth n — 2, i.e., n = 4. 

All these values are larger than the entropy of the root node of this tree, namely the 
standard basis. Therefore, the standard basis is the LSDB among the Haar- Walsh 
dictionary for n = 3. 

no > 3 (i.e., n > 8): What we saw shows that this is also true for any integer no > 3 
thanks to the self-similar structure of the binary tree dictionary. This ends the 
proof of the first part of the theorem: the standard basis is the LSDB among the 
Haar- Walsh dictionary for n > 3, i.e., n > 8. 

Therefore, we are left to consider the two special cases n = 2 and n = 4. 

n = 2: In this case, the components of the spike process in the Walsh basis are truly 
independent. Indeed, the representation of x — (xi,x 2 ) T in the Walsh basis is: 
( Xl ^ 2 , Xl ^ 2 ) T - The sum of the coordinate-wise entropy of the spike process relative 
to the Walsh basis is h+(l)+h_(l) = f(l)+g(l) = 0+1 = 1 bit. That of the standard 
basis (i.e., the root node) is clearly 2/(1/2) = 2 bits. Therefore, the Walsh basis 
always wins over the standard basis. Furthermore, the true entropy of this process 
is logn = log 2 = 1 bit, as explained in Subsection 5.3. Therefore, the mutual 
information of the spike process relative to the Walsh basis is I(Y) = 1 — 1 = bit. 
We therefore have truly independent components for the spike process in this basis 
for n = 2, which is of course the LSDB. 

n = 4: In this case, we consider all possible orthonormal bases in the dictionary exhaus- 
tively. Let us mark the table of Figure 7 with + and — signs. We observe: 

• each coordinate in the — h, H — , and nodes generates the same entropy, 

9(1) = 1 bit; 

• the coordinate in the ++ node generates f(l) = bit; 

• each coordinate in the — node generates g(l/2) = 3/2 bits; 

• each coordinate in the + node generates /(1/2) = 1 bit. 

From these coordinate-wise entropy values, we can compute the entropy of each 
possible basis in this dictionary as follows: 
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• the Walsh basis (the level k = 2 basis) generates 1x0 + 3x1 = 3 bits; 

• any basis using the + node generates entropy larger than 2x1 + 2x1 = 4 bits, 
hence is not the LSDB; 

• any basis using the — node generates entropy larger than 2x| + lxl=4 bits, 
hence is not the LSDB; 

• the standard basis generates 4 log 4 — 3 log 3 ~ 3.24 bits, hence is not the LSDB. 

Consequently, the LSDB for n = 4 is the Walsh basis. This basis does not provide 
the truly independent components since I{Y) = 3 — logn = 3 — log 4 =1^0. 

This concludes the proof of Theorem 5.1. □ 



6.5 Coordinate-wise Entropy of the Spike Process 

Before proceeding to the proof of Theorems 5.2 and 5.3, let us consider coordinate- 
wise entropy of the spike process and define some convenient quantities to characterize a 
basis in 0(n) or GL(n, R). 

Let us consider an invertible matrix U = («ij)ij=i,...,n = B^ 1 e GL(n, R), and the 
vector Y = UX. Let us consider the ith coordinate of Y, Y { = Y^ =1 UijXj. For each 
realization of the spike process X, Yi takes one of the values {uij,j = 1, . . . ,n}. More 
precisely, we have Pr{X, = 1} = 1/n and Pr{X, — 0} = 1 — l/n, for j — 1, . . . , n. Thus, 
if all {uij,j = 1, . . . ,n} were distinct, Y^ would take these values with a uniform pmf. 
But there is no particular reason that allows us to think {uij,j = 1, . . . , n} are mutually 
distinct. Therefore, we shall group these values in "classes" of equality. Let us introduce, 
for each i e {1, . . . , n}, an integer k(i) equal to the number of distinct values in the ith 
row vector {uij, j = 1, . . . , n}, and the vector c(i) = (aii(i), . . . ,oik(i)(i)) £ N fe W, where 
each component counts the number of occurrences of each distinct value in the ith row 
vector. We will call k(i) the class of the ith row and c(i) the index of that row. Clearly, 
we have 

k(i) 

1 < k(i) < n and V]a^(i) = n. 

e=i 

For example, with n = 3, if we had 

Y\ = A i + X2 + X3 

< Y 2 = 5Xi + 2X 2 + 2X3 , 
k = —Xi + x 2 

then we would get 



fc(l) 


= 1, 


c(l) 


= (3) 


< k{2) 


= 2, 


c(2) 


= (2,1) 


k(3) 


= 3, 


c(3) 


= (1,1, 
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since {uij} = {1, 1, 1} in which we find three l's, {u 2 j} = {5, 2, 2} in which we find two 
2's, one 5, and {■?%} = { — 1,1,0} in which we find one -1, one 1, and one 0. 

Let us now examine the coordinate-wise entropy in terms of the quantities we have 
just defined. Suppose the value u appears ae(i) times in {uij,j = 1, . . . , n}. Then the 
probability of the event {Yi = u] is an(i)jn. Therefore, we have 

fftfH-E— log 

^ n n 

We shall now describe the different values that this coordinate-wise entropy takes as the 
number of distinct values and their occurrences vary. Because the entropy is a measure 
of uncertainty, we can intuitively guess that a coordinate with a small class number 
generates small entropy. 

k(i) = 1: This necessarily means that c(i) = (n), i.e., all the {uij,j = l,...,n} are 
identical. Since there is no uncertainty about this coordinate, we can expect its 
entropy to be 0. Indeed, H(YJ = ~ £k=i I lo S I = °- 

k(i) = 2: Let us consider the link between the uncertainty and the index c(i). k(i) = 2 
means that can take only two distinct values. The least scattered distribution 
of these two values corresponds to the case c(i) = (l,n — 1). This is also the 
distribution closest to the certain case k(i) = 1 and c(i) = (n). We now show that 
the case c(i) = (l,n — 1) generates the smallest entropy. Suppose that can take 
two distinct values with index (an, a 2 ), «i + 0.2 = n. In other words, Yi takes these 
two values with probability ai/n and a-ijn — 1 — ai/n, respectively. Without loss 
of generality, we can assume a\ < a 2 - Therefore, the entropy of the coordinate Yi is 



H(Yi) 



O-l , Oil «2 r «2 

— log 1 log — 

n n n n 



— log — + 1 log 1 

n n \ n J \ n 



= f ^ 



n 

where the function / is defined in (6.6) and shown in Figure 4. Since ati < a 2 , it 
suffices to consider a.\ with 1 < a± < n/2. So, we have 1/n < cxi/n < 1/2, and in 
this interval, f(ati/n) is strictly increasing. In other words, 

1\ . „ /ai \ .„/l s 



Therefore, the entropy is minimal when a± — 1 and a 2 = n — 1. For a± > 2, we 
clearly have H(Y$ > /(2/n). 

k(i) > 3: To find a lower bound of H(Yi) = —J2e=l l°g^^' we nee< i the following 
lemma: 
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LEMMA 6.2. Let k > 3 be an integer, and let (a±, . . . , a^) be a set of strictly positive 
integers with Z)jLi a j — n - Then, 



. =1 n n \ n j \n 

See Appendix A for the proof of this lemma. 
Lemma 6.2 implies that 

We can now summarize these results as the following lemma: 

Lemma 6.3. The coordinate-wise entropy of the spike process after transformed by 
a basis in GL(n, R) can be computed or bounded as follows: 

(6.10) ifk(i) = 1, then H(Y$ = °5 

= /(l/n) if ai (i) = l; 



(6.11) ifk{i) = 2,thenH(Yi) 

(6.12) i/ fc(i) > 3, then H(Y$ > [ l + / Q) ^ ( X + f ) / Q) ■ 



> /(2/n) 2 < ai (i) < n/2; 

2(k - 2)\ (l\ ( 2\ (V 



Let us now come back to our invertible transformation U ; we are searching for the LSDB 
among O(n) or GL(n, R). This means that the cost of the LSDB, i.e., the sum of the 
coordinate-wise entropy of the LSDB coordinates, cannot be larger than that of the 
standard basis. Therefore we will always keep the standard basis in mind as a reference 
basis with which we shall compare the performance of all other bases. 

The standard basis corresponds to U — I n . Every row of the standard basis has index 
k{i) = 2 and c(i) = (l,n — 1). Hence the entropy cost of the standard basis is 

(6.13) C H (I n \X)=nx f(l/n) = nlogn - (n - 1) log(n - 1). 

We saw that, assuming k(i) > 1, H(Yi) > f(l/n), with equality if and only if k(i) = 2 
and c(i) = (l,n — 1). Therefore a basis with k(i) > 1 for every % e {1, . . . ,n} has no 
chance to win over the standard basis, and the best thing one can do with such a basis 
is to match the entropy with that of the standard basis, i.e., a basis with k(i) = 2 and 
c(i) = 1) for every %. 

So, the only chance to beat the standard basis is to have some "class 1" rows (i.e., 
k{%) = 1) in a basis. However, we will never find an invertible matrix with more than one 
class 1 rows. Indeed, a class 1 row is necessarily proportional to 1„ = (1, 1, . . . , 1), and 
it is evident that more than one class 1 rows cannot exist in any invertible matrix. 
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6.6 Proof of Theorem 5.2 

PROOF. Let us start with a simple remark. If we assume that B is an orthonormal 
basis, then U = B^ 1 = B T . Hence the rows of U are in fact the basis vectors of this 
basis. In the case of an orthonormal matrix, the presence of one row of class 1 imposes 
a constraint on the other rows, since these rows must form an orthonormal basis. The 
following lemma describes one of these constraints. 

Lemma 6.4. If k(l) = 1, then it is impossible to have two class 2 rows with index 
(l,n — 1) in a matrix U G 0(n). In other words, If k(l) = 1, then there do not exist 
H,i2 £ {1, • • • , n} such that i\ ^ i 2 and c(ii) = c(i 2 ) = (1, n — 1). 

The proof of this lemma can be found in Appendix B. 

Hence, assuming that k(l) = 1, we can have at most one row of class 2 with index 
(l,n — 1). All the other rows will be of either class k(i) > 2 or class k(i) = 2 with index 
(ai,n — ai), 1 < a.\ < n/2. Considering the minimization of the sum of the coordinate- 
wise entropy, we must have one row of class 1 and one row of class 2 with index (1, n — 1). 
All the other cases always increase the entropy, i.e., dependency. From (6.11) and (6.12), 
the entropy of a row with either k(i) > 2 or k(i) = 2 with index (a 1; n — ai), 1 < aii < n/2 
is bounded from below as 

*«*-(K)'G).'(D)-- 

Therefore, combining this with (6.10) for k(l) = 1 and (6.11) for ot\ = 1, we have 



11 /1\ 
(6.14E#(*)>0 + /(-)+(n-2) 

i=i vn/ 
We now use the following lemma: 

Lemma 6.5. Forn>6, 



(n^ (n) ' ^ (n) ^ (n^ ' 



PROOF. Let us define a function: r(x) = x 



nj J n \n 



!/©-(/(!) -/(*) 



for x > 2, where 



/ is defined in (6.6). This is a continuous and monotonically-decreasing function for 
x > 2, since 



r'(x) = log(x — 1) + log - — - < for x > 2. 



2 x — 2 

— log(x - 1) + log - 

x z x — 1 

Moreover, we have r(5) ~ 0.199 and r(6) ~ —0.310, and we can find a zero of r(x) 
numerically, i.e., r(x*) = where x* ~ 5.3623. These prove that this function is negative 
if x > x*. Therefore, for each integer n > 6, r(n) < 0, i.e., 



n \nj \nj \n 
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□ 



Using this lemma for n > 6, (6.14) can be written as 



Y.H{Y l )>f(-)+{n-2) 



i=i 



n 



-/(-) + /(- 

n Vn/ \n 



2(n - 2) 



+ n 



Therefore, if we compare the mutual information of the new coordinates to that of the 
standard basis, we have 



I(Y)-I(X)> 
That is, 



2(n-2) 



n 



^ n 



2(n-2) 



7(V) - /(X) > 



n - 4 



Thus, B = U 1 = U T is not the LSDB. We have therefore proved that any orthonormal 
basis yields a larger mutual information than the standard basis for the spike process for 
n > 6. 

We can summarize our results so far. 

• For n > 6, the standard basis is the LSDB among O(n). 

• Any basis that yields the same mutual information as the standard basis necessarily 
consists of only class 2 rows with index (1, n — 1). 

Now the question is whether there is any other basis except the standard basis satis- 
fying this condition. The following lemma concludes the proof of Theorem 5.2 for n > 6. 

Lemma 6.6. For n > 2, an orthonormal basis consisting of class 2 rows with index 
— 1) other than the standard basis is uniquely (modulo permutations and sign flips 
as described in Remark 2) determined as (5.1), i.e., 



Bo(n) = — 



Ti-2 -2 
-2 n-2 



■■ -2 

■ . -2 

-2 n-2 



The proof of this lemma can be found in Appendix C. Note that this matrix becomes 
a permuted and sign- flipped version of I 2 when n — 2, and approaches to the identity 
matrix as n — > oo. 

We now prove the particular cases, n = 3, 4, 5 in Theorem 5.2. For these small values 
of 7i, we cannot use Lemma 6.5 anymore since we have 

(n^ (n) ' ^ (n) ^ (n)) ^ (n) ^ (n) 
Therefore, we prove these cases by examining exhaustively all possible indexes and the 
coordinate-wise entropy they generate. 
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n = 3: The possible indexes are (3), (1,2) and (1,1,1), which generates the following 
entropy values (in bits): 



(3) : Hpi) = 0; 
(1, 2) : H(YJ = f Q 

;i,i,i): J ff(y j )=3x 



112 2 
-3 l0g 3-3 l0g 3 



log3--; 



1, 1 
-- log - 

3 S 3 



log 3. 



Once again, the only possibility for a basis to generate lower entropy than the 
standard basis is to include a class 1 row with index (3). But here we still cannot 
have two class 2 rows of index (1, 2) on top of the class 1 row since Lemma 6.4 still 
holds for n — 3. Therefore, the best combination is to have one row of each possible 
class, which leads to the following global coordinate-wise entropy: 

2 

+ log3 - - + log3 ~ 2.50 < 31og3 - 2 log2 ~ 2.75, 

that is, this best possible basis is better than the standard basis. Therefore, the 
LSDB is a basis including a vector of each class. Considering the orthonormality 
of the basis, we can only have the following basis or its permuted or sign-flipped 
versions for n = 3: 



U 1 =B 



_l 1 1_ 

Ve V2 
j i -i 

V3 V6 V2 

-i- — 



n = 4: The possible indexes are: (4), (1,3), (2, 2), (1, 1, 2), and (1, 1,1,1), which generate 
the following entropy values (in bits): 

0; 



(4) 




(1,3) 


HOQ 


(2,2) 


H{Yi) 


(1,1,2) 


H{Yi) 


1,1,1,1) 


H{Yi) 



1, 1 3, 3 
— log log - 

4 S 4 4 & 4 



'G 

1 

~4 
4 x 



= 1; 

1 1 

4 ~ 4 



1 

4 



1, 1 
- log - 

2 & 2 



2 - - log 3 ~ 0.811; 



1.5; 



The total coordinate-wise entropy of the Walsh basis is H\y = + 3 x /(1/2) = 3 
bits. We know from Theorem 5.1 that Hy/ is smaller than that of the standard basis. 



Let U be an ortho normal basis, and let {bj \i 



, 4} be its rows. If U generates 



smaller entropy than the Walsh basis, it necessarily includes one class 1 row and one 
class 2 row with index (1,3) from the same argument as the proof of Lemma 6.4 
(see Appendix B). Let us assume that b\ is of class 1 and b^ of class 2 with index 
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(1, 3). In other words, c(l) = (4) and c(2) = (1, 3). Now, b\ is of the form (a, a, a, b) 
and orthogonality with bj implies that b\ is proportional to the vector (1, 1, 1, —3). 
Now, U cannot include a class 4 row vector of index (1, 1, 1, 1). If so, these three 
rows (i.e., rows of class 1, 2, and 4) generate the entropy + 0.811 + 2 = 2.811 
bits. Hence, any other admissible choice for the remaining row, i.e., a class 2 row 
with index (1,3), which generates 0.811 bits, or a class 2 row with index (2,2), 
which generates 1 bit, or a class 3 row with index (1, 1, 2), which generates 1.5 bits, 
ends up larger total coordinate-wise entropy than the Walsh basis. Therefore we 
can discard these combinations immediately, and the indexes of b^ and b 4 must be 
chosen from (2,2) and (1,1,2). If bj is of index (2,2), it is of the form (a, a, b, b) 
and orthogonality with bj implies that b% is proportional to (a, a, —a, —a). Then, 
orthogonality with b% implies: a + a — a + 3a = 0, i.e., a = 0. Therefore the only 
possibility for and 64 is to be both of index (1, 1, 2), each of which generates the 
coordinate- wise entropy 1.5 bits. The total coordinate-wise entropy generated by U 
is therefore at least + 0.811 + 2 x 1.5 = 3.811 > 3 = H w , hence U T is not the 
LSDB. We can now conclude that the LSDB among 0(4) is the Walsh Basis. 

n = 5: In this case, we prove that the LSDB is the standard basis or the basis of the 
Householder reflection (5.1), both of which consist of class 2 rows with index (1,4) 
only. Indeed, using the similar argument as before, any basis generating smaller 
entropy than these two bases must have a class 1 row and a class 2 row with index 
(1,4). However, since the other three rows must be either of class 2 with different 
indexes or of class 3 or higher, the total entropy of such a basis is larger than that 
of the standard basis or the Householder reflection basis: 

£ H{Yi) > + / Q + 3 x / (J) ~ 3.635 > 5 x / Q ~ 3.609. 

This concludes the proof of Theorem 5.2. □ 

6.7 Proof of Theorem 5.3 

Proof. For the case T> = GL(n, R), the constraint imposed by Lemma 6.4 is lifted 
since the rows of U — B^ 1 do not have to form an orthonormal basis anymore. Hence we 
can have as many rows of class 2 with index (l,n — 1) as we wish, even if the first row 
of U is of class 1. Clearly, we still cannot have two class 1 rows because this violates the 
invertibility of U. Therefore, considering all these remarks and the classification of indexes 
established in the previous subsections, it is immediate to conclude that the combination 
of classes of rows leading to the smallest sum of coordinate-wise entropy is one row of 
class 1 and n — 1 rows of class 2 with index (l,n — 1). This matrix reaches the lower 
bound for the total coordinate- wise entropy (n — l)/(l/n). Considering the invertibility 
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of the matrix with n — 1 rows of class 2, the most general form of the admissible matrices 
is the following (modulo permutations and sign-flips mentioned in (5.2)): 



U G L(n,n) - B G l {n R) 



a a 

b 2 c 2 b 2 ■■■ 

h h c 3 b 3 



a 
b 2 

h 



b n -l &n-l Cn-l b n -\ 

bn b n c n 

where a, bk, Ck, k — 2, . . . , n, must be chosen so that C/gl(ti,r) e GL(n, R). We can easily 
compute the determinant of this matrix in a similar manner that we derived (6.1): 



■it, 

det (c/ G L(n,R)) = a Y[ (c fc - b k ). 



k=2 



Therefore, we must have a ^ and bk ^ Ck for k = 2, . . . , n for C/gl(ji,r) to be in 
GL(n, R). Note that if we want to restrict the dictionary to SL ± (n, R), then we must 

have det (fsL ± (n,R)) = i- e -' a mus t satisfy a = ±llfc=2( c fc ~ &fc) -1 - 

The corresponding inverse matrix (5.3) can be computed easily by elementary linear 
algebra, i.e., the Gauss- Jordan method. We show this matrix here again: 



B, 



GL(n,R) 



'1 + EL2 hdk) I a -d 2 -d 3 
—b 2 d 2 /a d 2 



-dr, 



-b 3 d 3 /a 



-b n d n /a 



d 3 







(L. 



where dk = l/(ck — bk), k = 2, . . . , n. These are the LSDB pairs (analysis and synthesis 
respectively). This concludes the proof of Theorem 5.3. □ 

6.8 Proof of Proposition 5.3 

If we transform the spike process X by the Householder reflector Bo( n ) (5-1), the 
number of nonzero components of Y — Bq^X can be easily computed as 

Co (S 0( n) | -X - ) =£||y|| = ra. 

Now, let us consider the case < p < 1. Since n > 2, we have 

C p (B 0{n) | X) = E\\Y\\l = (l - + (n - 1) (IJ . 
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Let us now define the following function: 



Sp (x) = (1 - xf + ( - - 1 ) x v = (1 - xf - x p + 



2 \ „ , 2 



where < x = 2/n < 1. Taking the derivative with respect to x, we have 

( 1 1 \ 2(p- 1) 

pV 7 ^ \ (i -xy-p x 1 ^ j x 2 -p 

for < x < 1 and < p < 1. Therefore, in this interval, s p (x) is monotonically 
decreasing, and the decisive term for the sparsity measure C p is 2/x l ~ p . Therefore, we 
have 

lim C v (B (n) I X) = lims„(x) = oo for < p < 1. 

rc— >oo K V y ' ' / xj.0 

If p — 1, then si(rr) = (1 — re) — x + 2 = 3 — 2x. Hence, we have 

lim Ci (Bo(n) I X ) = limsi(:r) = 3. 
This completes the proof. □ 

6.9 Proof of Corollary 5.1 

Proof. We now consider the mutual information of the spike process under the 
LSDB pair (5.2) and (5.3) in Theorem 5.3, which was proved in the previous subsection. 
Using this analysis LSDB, the mutual information of Y = B^^^X is 

I{Y) = -H{X) + ±H{Y i ) 
i=i 



- log n + (n -!)/(£) 



— logn + (n — 1) 



logn log(n — 1) 

n 



(n - l) 2 

(6.15) = (n- 2) logn- ^ Mog(n-l). 

n 

Let h(n) denote the last expression in (6.15). Note that h(2) = 0, i.e., we can achieve the 
true independence for n = 2. If n > 2, this function is strictly positive and monotonically 
increasing as shown on Figures 8 and 9. By expanding the natural logarithm version of 
h(x), we have 

ln2 x h(x) = (x-2)\nx- ^— ^ 2 \n(x- 1) 

x 

= (x-2)\nx- (x-2 + -) (\nx + \n(l 



xj \ \ x 

\nx + — 

xj \ x 2x z Var 

\nx ( 1\ /l 1 / 1 



(x - 2)\nx - (x - 2 + - 
V x 



+ [x-2 + -)[- + — + o 



x \ x) \x 2x 2 \x 2 

Inx 3 / 1 



= 1 ^T + °v . 
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Fig. 8. A plot of the function In 2 x h : x -> (x - 2) In a; - (x ~ 1)2 ln(x - 1). 
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Fig. 9. A plot of the function In 2 x h(x) for large x. 
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In other words, we have established 

I(Y) ~ - — - f 1 | as n — > oo. 

v ; In 2 y n / 

In other words, 

Km /(Soi (BiR) Jr) = j^ = lege « 1.4427. 

Therefore, for n > 2, there is no invertible linear transformation that gives truly inde- 
pendent coordinates for the spike process. 

As for the orthonormal case, using (6.13), we have 

l(Bl (n) X) = nlogra-(ra-l)log(ra-l)-logn = (n-1) log—- = log + — -) . 
Now, it is easy to see 

i im/(5S (n) x)=loge. 
This completes the proof of Corollary 5.1. □ 

7. Discussion 



In general, sparsity and statistical independence are two completely different concepts 
as an adaptive basis selection criterion, as demonstrated by the rotations of the 2D 
uniform distribution in Section 4. For the spike process, however, we showed that the BSB 
and the LSDB can coincide (i.e., the standard basis) if we restrict our basis search within 
O(n) with n > 5. However, we also showed that the standard basis is not the only LSDB 
in this case. To our surprise, there exists another orthonormal basis (5.1) representing the 
Householder reflector, which attains exactly the same level of the statistical dependence 
as the standard basis, if evaluated by the mutual information or equivalently by the 
total coordinate-wise entropy Ch defined in (3.3). Yet this LSDB does not sparsify the 
process at all if we measure the sparsity by the expected £ p norm C p defined in (3.1) where 
< p < 1. It is also interesting to note that this Householder reflector approaches to the 
standard basis as n — > oo. Furthermore, if we extend our basis search to GL(n, R), then 
the LSDB and the BSB cannot coincide. 

What do these observations and the effort to prove these theorems suggest? First, 
it is clear that proving theorems on the LSDB and computing it for more complicated 
stochastic processes would be much more difficult than the BSB. To deal with statistical 
dependency, we need to consider the "stochastics" explicitly such as entropy and the pdf 
of each coordinate. On the other hand, sparsity does not require such information. In fact, 
one can even adapt the BSB for each realization rather than for the whole realizations; 
see Saito et al. (2000, 2001) for further information about this issue. 

Second, Remark 4 and Proposition 5.3 cast questions on the appropriateness of the 
p> norm (0 < p < 1) (3.1) as a sparsity measure. According to this measure, the bases 
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(5.1) and (5.4) provide completely "dense" coordinates for the spike process. Yet, if we 
look at these basis vectors carefully, they are very "simple" in the sense that at most one 
component differs from all the other common components in each basis vector. In other 
words, the sparsity measured by the £ p norm does not imply the simplicity measured by 
the entropy, and vice versa. Therefore, if a given problem really requires the statistical 
independence criterion, then we cannot replace it by the sparsity criterion in general. 

Then, why the sparse basis of Olshausen and Field and the ICA basis of Bell and 
Sejnowski were more or less the same? Our interpretation to this phenomenon is the 
following (see also Remark 5). The Gabor-like functions they obtained essentially convert 
an input image patch to a spike or spike-like image. In our opinion, the image patch size 
such as 16 x 16 pixels were crucial in their experiments. Since those image patches are 
of small size, they tend to have simpler image contents such as simple oriented edges. 
It seems to us that if their algorithms were computationally feasible to accept image 
patches of larger size such as 64 x 64 or 128 x 128, both the BSB and the LSDB would 
be very different from Gabor-like functions. These large size image patches (due to rich 
scene variations and contents in the patches of these sizes) cannot be converted to spikes 
by Gabor-like simple functions. 

We also note that the LSDB is not guaranteed to provide the true statistically inde- 
pendent coordinates in general. Therefore, if our interest is data compression, it seems 
to us that the pursuit of sparse representations should be encouraged rather than that of 
statistically independent representations. This is also the view point indicated by Donoho 
(1998). However, this does not mean to downgrade the importance of the statistical inde- 
pendence altogether. If we want to separate mixed signals or to build empirical models of 
stochastic processes for simulation purposes, then pursuing the statistical independence 
should be encouraged, and we expect to see further interplay between these two criteria. 

Finally, there are a few interesting generalizations of the spike process, which need 
to be addressed in the future. One is the spike process with varying amplitude. The 
spike process whose amplitude obeys the normal distribution was treated by Donoho et 
al. (1998) to demonstrate the superiority of the non-Gaussian coding using spike location 
information over the Gaussian-KLB coding. The other generalization is to randomly 
throw in multiple spikes to a single realization. If one throws in more and more spikes 
to one realization, the standard basis is getting worse in terms of sparsity. It will be an 
interesting exercise to consider the BSB and the LSDB for such situations. 
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Appendix A: Proof of Lemma 6.2 



Proof. First we need to show another lemma as follows: 

Lemma A.l. Let p 2 > pi > 1 be positive integers such that p\ + p 2 < n. Then 

Pi, Pi , Pi, P2 , Pi + P2 , P1+P2 2/1 

— log 1 log — < log / - 

n n n n n n n \n 

where f is defined in (6.6). 
Proof. The left-hand side of the inequality can be written as 

'pi + P2 



Pi , Pi . Pi, P2 
— log 1 log = 

n n n n 



(A.l) 

However, it is clear that 



n 

Pi +Pi 
n 

Pi +Pi 
n 

1 



Pi , Pi . Pi 1 P2 
1 log 1 ; log — 

P1+P2 n P1+P2 n 



, P1+P2 t 
log h 



Pi , Pi 
log 



P1+P2 " Pi + P2 + Pi + P2 " Pi + P2 



P2 , P2 

log- 



log 



Pi +P2 



Pi +P2 



Pi 



Pi +P2 



> 



Pi 



> 



1 



> 



1 



2 pi +P2 P1+P2 n 
From the monotonicity of f(x) for x e [0, 1/2], we deduce 



1 = / 



>/ 



Pi 



Pi +P2 



>.f 



which we can rewrite as 



-1 < -/ 



Pi 



- 1 1 5.1 ' 



yPl +P2, 

This inequality, nonncgativity of /, and the assumption of this lemma yields 



Pi +P2 



Pi 



Pi +P2 



This inequality combined with (A.l) completes the proof of Lemma A.l. 

Coming back to the proof of Lemma 6.2, we now use induction as follows. 

k = 3: Since ct\ + a 2 < n, we can use Lemma A.l to assert 

Q-i , Qi a 2 . a 2 . cti + a 2 . a x +a 2 2 
— log 1 log — < log / 

n n n n n n n 

Therefore, 



— log — < — log — + 

11 71 71 71 



0=1 



a 3 ct\ + a 2 011 + ol 2 

— H log 

n n n 



= ^iog^ + (i-^)iog(i-^)-l/fi 

n n \ n / V n / n \n 



n \n 



We used the fact a j — n to derive the equality in the second line of the above expression. 

Since aj > 1 for j = 1, 2, 3, we must have (n — l)/n > a^/n > 1/n. Considering the symmetry of 
f(x) around x — 1/2 and its behavior, we can deduce that 

n n \n ) n \n ) \ n J \n 



This nails down the case k = 3. 
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k =j> k + 1 : Let us demonstrate that, assuming that the formula is true for k > 3, it is still true for k+l. 
We can decompose the sum J2j=l 7T log tt m the following way: 

/c+l fc— 1 

Ea>j a>j a k+ i a k +i , a> k a k v-^ aj aj 
— log— = log 1 log h> —log—. 
n n n n n n ' n n 

But once again, since at + a^+i < n, we can use Lemma A.l to reach 

(A.2) log 1 log — < log / - 

n n n n n n n \n 

Let us rename a sequence {ctj} as follows: 

J aj+i +ctj if j = fc; 

aj if J = l,...,k- 1. 

Then, using the induction assumption, we can rewrite (A.2) as 

E 1 Oij a.j fa Pk Pj , /3j 2 / 1 \ 

— log — < — log — + > log - -/ - . 
n n n n *r~i n n n \n J 

Since ft = Sj=i a j = n ' wc can state that 

^-^ n n ^— J n n n \n 

<-(i + ^)/(i)-2/(i 

\ n J \n J n \n 

This concludes the proof of Lemma 6.2. □ 

Appendix B: Proof of Lemma 6.4 

Proof. Let us prove this lemma with reductio ad absurdum. Let us assume that, for example, c(2) = 
c(3) = (l,n — 1). Since the first row of U is proportional to (1, 1, ... , 1), all the other rows must 
satisfy ^" =1 «ij = for i = 2, . . . , n because of the orthonormality condition. Let us now consider 
the second row («2i, ■ • ■ , u 2n )- Since c(2) = (1, n— 1), let us assume U21 = a and u 2 j = b, j = 2, . . . , n 
for some a, b € R. Then the orthonormality condition implies a + (n — l)fe = 0. Since the norm of 
this row vector has to be one, we also have a 2 + (n — l)b 2 = 1. From these two constraints, we have 
(n - Ifb 2 + {n- l)b 2 = 1. This implies a = ±\f^ and b = + 1 -. 

As the second and third rows of U must be linearly independent, we need to assume that the third 
row is (c, d,c,...,c) for some c, d € R. (We cannot assume (d, c, . . . , c) for the third row since its inner 
product with the second row gives ad + (n — l)bc = 0, which leads to c = d using the values of a and 
b obtained above.) Then, similarly to the second row, we also get d+ (n — l)c = 0, d 2 + (n— l)c? = 1. 
Thus, we have d — +a and c = ±6. Then, regardless of the choice of the signs for a, b, c, d, the 
orthogonality of the second and third rows yields 

= (n - 2)b 2 + 2ab = (n - 2) ■ , 1 - - 2 • -. 

n(n — 1) n 

This leads to 2 = ^+j, i.e., 2n — 2 = n — 2, and finally n = 0. This contradiction implies that the 
assumption made is impossible, and proves the lemma. □ 
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Appendix C: Proof of Lemma 6.6 



Proof. Our strategy of proving this lemma is the following. First we will show that the LSDB selected 
from O(n), which consists of only class 2 row vectors with index (l,n — 1), must be of the form: 



(C.l) 



ai &i ••• 

b 2 a 2 b 2 



h 
b 2 



b n -i b n -\ a n -i b n -\ 

bn b n d n 

where a| + (n— = 1 for k = 1, . . . ,n. We then derive the final form (5.1) using the orthonormality 
of the row vectors of this matrix (C.l). 

Since each row is of class 2 with index (l,n — 1), only one the entry in a row must be different from 
all the other n — 1 entries. Therefore, without loss of generality, in the fcth row, let a k be such a 
distinguishing entry and bk be the other n — 1 entries. Let -B = ?7 T be the LSDB under consideration. 
Suppose U has the ith and jth rows in which the locations of <Xj and aj coincide. Without loss of 
generality (modulo row and column permutations), we can assume that U is of the following form. 



(C.2) 



ai bi ■■■ 

a 2 b 2 ■■ ■ 

b 3 a 3 b 3 

b n -i ■ ■ ■ 



■■■ h 
... b2 
... h 

dn-l b n -l b n -\ 



bn b n a n b n 

From the normalization condition, we must have: 

(C.3) a\ + (n- 1)^ = 1 for jfe = l,...,n. 

From the orthonormality condition, U T U = I n , the diagonal entries of U T U are: 



(U T U) hl = l = a\+al + Y,b\ 



3=3, 



{U T U) Kk = l = a 2 k + J2 b l 2 < fc < n, 



(u T u) n , n = i = ]T b 



3=1 



These imply that a\ — b\ for k > 3. Inserting this to (C.3) and noting that we must have a k ^ bk 
because of the class 2 condition, we obtain: 

(C.4) a k = ±1/Vn, b k = Tl/Vn, for fc > 3. 

Consider now the off-diagonal entry of C/ T £7, for example, 

(C/ T C/)i : 2 = = ai&i + a 2 6 2 + a 3 b 3 + b\ + ■ ■ ■ + b 2 n , 
{U T U) ljn = = 0161 + a 2 fo 2 + b 2 3 + b\ + ■ ■ ■ + b 2 n 
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Inserting (C.4) into these, we get 

1 „ 

ai&i + a 2 b 2 1 = 

n n 

n-2 „ 

a\b\ + a 2 b 2 H = 0. 

n 

This is a contradiction (i.e., a\b\ + a 2 b 2 cannot have two different values). Therefore U cannot have 
two rows where the distinguishing entries a*, a,j share the same column index as (C.2). It is clear 
that we cannot have more than two such rows. Therefore, U must be of the form (C.l). 
Now, let us compute the entries of (C.l). The normalization condition (C.3) still holds. Computing 
the diagonal entries of U T U — I n , we have 



(C.5) {U 1 U) k ,k = l 

Combining (C.3) and (C.5), we have: 



a 



for k = 1, . . . ,n. 



nb\ = b 2 j for k = 1, . . . , n. 

This implies that b\ = • • • — b\. Then, from the normalization condition (C.3), we must have 
a\ = ■ ■ ■ = c? n also. Consider now the off-diagonal entry of U T U : 

(U T U) h2 = = oi 6i + a 2 b 2 + (n - 2)6?. 

Now, we must have b 2 — b\ or b 2 — —b\. So, the above equation can be written as 

(U T U) h2 = = oi 6i ± a 2 h + (n - 2)6?. 

This implies that either b\ — or ai ± a 2 + (n — 2)b\ = 0. &i = leads to 6^ = and — ±1, i.e., 
the standard basis. Let us consider now the other case, i.e., a\ ± a 2 + (n — 2)b\ = 0. Since a 2 = ai 
or a 2 = —ai, these lead to either b\ = or 2ai + (n — 2)6i = 0. The former case has been already 
treated. Thus, let us proceed the latter case. From this, we have 



(C.6) 

Inserting this into (C.3), we have 
Consequently, 



ai 



b\ = 



n-2 



Because of (C.6) (that is true for all k), we have: 



(C.7) 



a k = ±- 



n 



=F — , for k = 1, 

n 



This means that the matrix U must be of the following form or its permuted and sign-flipped versions: 

i-2 -2 ••• -2 



n 



-2 n-2 



-2 ••• -2 n-2 

It turns out that this is symmetric, so we have B = U. This completes the proof of Lemma 6.6. □ 
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